Introduction

Overview and Motivation

During the last 2 years, COVID-19 has been a main focus of the news. Though around 3% of the world population had COVID-19, diabetes can be considered as an even bigger health problem. Indeed, according to the International Diabetes Foundations (IDF), in 2019, 463 million adults were living with diabetes (around 6-7% of the world population) and this number is forecasted to rise to 700 million by 2050. Furthermore, 90% of cases of diabetes are of type 2, which means it results mainly from bad habits and not genetics. However both types of diabetes can be treated and/or prevented with a healthier diet and more physical activity. Additionally, according to the WHO, low income countries are more susceptible to having higher diabetes prevalence. Living in Europe, we observed that diabetes rates differ a lot from one country to another, so we wanted to find out if these rates were indeed linked to a country’s income, and if the nutritious composition of richer states’ population’s diet is also affected by this income difference and if yes, how it is affected.

Research questions

Therefore, we would like to find out answers to the following questions :

  1. Do European countries that have higher GDPs really have lower diabetes prevalence ?

  2. Do European countries that have higher GDPs consume less calories ?

  3. How do the proportions of macronutrients (animal protein/plant protein/fat/carbohydrates) consumed differ between richer and poorer governments ?

  4. And how do these differences relate to the diabetes prevalence in these countries ?

  5. What is the typical diet that can be observed in richer states that relates to lower diabetes prevalence ?

Data

For our research, we used three different datasets. While searching for datasets, we made sure that the years and countries matched for every one of them.

Wrangling and cleaning

Caloric consumption

The first is the dataset with information related to caloric consumption. We downloaded it from the portal https://ourworldindata.org/diet-compositions. It is composed of many information related to caloric consumption for almost all countries in the world. It gives us information on the average nutrition of different countries from 1961 to 2013, and we have information about:

  • We only wanted to focus on observations from European countries with Switzerland and the UK in the 2000s, which is why we have sorted the data as follows:
#> # A tibble: 406 x 7
#>    Entity  Code   Year `Calories from animal ~ `Calories from plant ~
#>    <chr>   <chr> <dbl>                   <dbl>                  <dbl>
#>  1 Austria AUT    2000                    259.                   165.
#>  2 Austria AUT    2001                    259.                   168.
#>  3 Austria AUT    2002                    258.                   166.
#>  4 Austria AUT    2003                    240.                   168.
#>  5 Austria AUT    2004                    237.                   165.
#>  6 Austria AUT    2005                    242.                   169.
#>  7 Austria AUT    2006                    236.                   169.
#>  8 Austria AUT    2007                    242.                   173.
#>  9 Austria AUT    2008                    237.                   170.
#> 10 Austria AUT    2009                    240.                   170.
#> # ... with 396 more rows, and 2 more variables:
#> #   Calories from fat (FAO (2017)) <dbl>,
#> #   Calories from carbohydrates (FAO (2017)) <dbl>
  • We used the ISO code as it is standardized worldwide and does not have the risk of having different names in different tables like the countries’ names.

  • We then focused on these variables:

    • Entity Name of the country
    • Code ISO country code
    • Calories from animal protein (FAO (2017)) The average per capita supply of calories derived from animal protein all measured in kilocalories per person per day
    • Calories from plant protein (FAO (2017)) The average per capita supply of calories derived from plant protein, all measured in kilocalories per person per day
    • Calories from fat (FAO (2017))The average per capita supply of calories derived from fat, all measured in kilocalories per person per day
    • Calories from carbohydrates (FAO (2017)) The average per capita supply of calories derived from carbohydrates, all measured in kilocalories per person per day
  • The intake of specific macronutrients (carbohydrates, protein and fats) are derived based on average food composition factors – these factors are derived and presented in the Food and Agriculture Organisation’s (FAO) Food Balance Sheet Handbook (https://www.fao.org/faostat/en/#data).

  • We then computed the mean of the consumption for each type of macronutrient in each country between the years 2000 and 2013, and the sum of total calories per person per day for each country in order to answer our second research question.

  • We then create a new table with the mean and we also add the sum of total calories per person per day for each country in order to get a broader view with the total consumption of calories.

  • Our assumption was that a county’s wealth may fluctuate over the course of 10 years (ex: a dip during the economic crisis of 2008) but an overall mean is sufficient to compare the different countries and their riches.

  • With these new numbers, we created a dataframe and named the columns accordingly:

    • country_code ISO country code
    • cal_prot_animalThe mean of the calories from animal protein consumed in each country in the years 2000-2013
    • cal_prot_plant The mean of the calories from plant protein consumed in each country in the years 2000-2013
    • cal_carbsThe mean of the calories from carbohydrates consumed in each country in the years 2000-2013
    • cal_fat The mean of the calories from fat consumed in each country in the years 2000-2013
    • total_consumption The total consumption based on the means of the consumption of each type of macronutrients in each countries in the years 2000-2013

GDP

Our second dataset, downloaded from the portal https://data.worldbank.org, gives us information about the GDP of many countries over the course of 60 years (1960-2020).

  • It is composed of 266 observations of 65 variables :

    • Country Name Name of the country
    • Country Code ISO country code
    • Indicator Name equal to “GDP in current US$” for every row
    • Indicator Code equal to “NY.GDP.MKTP.CD” for every row
      And a variable for each year from 1960 to 2020
  • As we can see below, RStudio imported the Excel file as is, and so our column names found themselves at the 3rd row and therefore column names of columns 3 to 65 have been attributed numbers.

#> # A tibble: 269 x 65
#>    `Data Source` `World Developm~ ...3  ...4  ...5  ...6  ...7  ...8 
#>    <chr>         <chr>            <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 Last Updated~ 44454            <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  2 <NA>          <NA>             <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  3 Country Name  Country Code     Indi~ Indi~ 1960  1961  1962  1963 
#>  4 Aruba         ABW              GDP ~ NY.G~ <NA>  <NA>  <NA>  <NA> 
#>  5 Africa Easte~ AFE              GDP ~ NY.G~ 1929~ 1970~ 2147~ 2570~
#>  6 Afghanistan   AFG              GDP ~ NY.G~ 5377~ 5488~ 5466~ 7511~
#>  7 Africa Weste~ AFW              GDP ~ NY.G~ 1040~ 1113~ 1194~ 1268~
#>  8 Angola        AGO              GDP ~ NY.G~ <NA>  <NA>  <NA>  <NA> 
#>  9 Albania       ALB              GDP ~ NY.G~ <NA>  <NA>  <NA>  <NA> 
#> 10 Andorra       AND              GDP ~ NY.G~ <NA>  <NA>  <NA>  <NA> 
#> # ... with 259 more rows, and 57 more variables: ...9 <chr>,
#> #   ...10 <chr>, ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>,
#> #   ...15 <chr>, ...16 <chr>, ...17 <chr>, ...18 <chr>, ...19 <chr>,
#> #   ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>, ...24 <chr>,
#> #   ...25 <chr>, ...26 <chr>, ...27 <chr>, ...28 <chr>, ...29 <chr>,
#> #   ...30 <chr>, ...31 <chr>, ...32 <chr>, ...33 <chr>, ...34 <chr>,
#> #   ...35 <chr>, ...36 <chr>, ...37 <chr>, ...38 <chr>, ...
  • We decided to fix that and to filter out the years that is in our interest and that we have in common with other tables, so the years 2000-2013. We decided to get rid of the ‘Indicator Name’ and ‘Indicator Code’ variables as well since the values are the same for every row and they do not provide additional useful information.

  • Now, we want to filter out the European countries, just like in the first table :

  • In order to join tables easily, we transformed the columns corresponding to different years to a single “year” column, in order to have at each row of this dataset the GDP of a certain country at a certain year.

  • To make it easier to manipulate data, we decided to rename our variables for this table as well. We also made sure that the type of our numeric variable (GDP) was numeric and not character, like it was by default. In order to have graphs that are easy to read in the exploratory data analysis, we also decided to divide the avg_gdp column by a billion.

  • Lastly, we computed the average GDP for each country over the years 2000-2013 in order to be able to plot different variables together.

  • We now have a dataframe with the following variables :

    • country_name name of the country
    • country_code ISO code of the country
    • avg_gdp the average GDP of a country over the course of 2000-2013

Diabetes

The third dataset, from https://www.ncdrisc.org/data-downloads-diabetes.html, gives us information about the age-standardised diabetes prevalence for each country and gender from 1980 to 2014.

  • It is composed of 14’000 observations for 7 variables :

    • Country/Region/World Name of the country
    • ISO ISO country code
    • Sex Gender for which the diabetes prevalence is measured in a certain country at a certain year
    • Year Year of observation (1980-2014)
    • Age-standardised diabetes prevalence Diabetes rate considering all ages
    • Lower 95% uncertainty interval Lower confidence interval limit for the diabetes rate
    • Upper 95% uncertainty interval Higher confidence interval limit for the diabetes rate
  • We filter our data to keep only observations between 2000 and 2013 (common interval between our 3 datasets).

  • We keep only European countries as our two first datasets:

#> # A tibble: 870 x 7
#>    `Country/Region/World` ISO   Sex    Year `Age-standardised diabet~
#>    <chr>                  <chr> <chr> <dbl>                     <dbl>
#>  1 Austria                AUT   Men    2000                    0.0514
#>  2 Austria                AUT   Men    2001                    0.0520
#>  3 Austria                AUT   Men    2002                    0.0525
#>  4 Austria                AUT   Men    2003                    0.0529
#>  5 Austria                AUT   Men    2004                    0.0532
#>  6 Austria                AUT   Men    2005                    0.0534
#>  7 Austria                AUT   Men    2006                    0.0535
#>  8 Austria                AUT   Men    2007                    0.0536
#>  9 Austria                AUT   Men    2008                    0.0536
#> 10 Austria                AUT   Men    2009                    0.0535
#> # ... with 860 more rows, and 2 more variables:
#> #   Lower 95% uncertainty interval <dbl>,
#> #   Upper 95% uncertainty interval <dbl>
  • We will not use the 95% confidence interval in our plots.

  • Now we will separate our dataset into two subsets. One with data about men :

#> # A tibble: 435 x 5
#>    country ISO   sex    year prop_diabetes
#>    <chr>   <chr> <chr> <dbl>         <dbl>
#>  1 Austria AUT   Men    2000        0.0514
#>  2 Austria AUT   Men    2001        0.0520
#>  3 Austria AUT   Men    2002        0.0525
#>  4 Austria AUT   Men    2003        0.0529
#>  5 Austria AUT   Men    2004        0.0532
#>  6 Austria AUT   Men    2005        0.0534
#>  7 Austria AUT   Men    2006        0.0535
#>  8 Austria AUT   Men    2007        0.0536
#>  9 Austria AUT   Men    2008        0.0536
#> 10 Austria AUT   Men    2009        0.0535
#> # ... with 425 more rows
  • Another one with data about women :
#> # A tibble: 435 x 5
#>    country ISO   sex    year prop_diabetes
#>    <chr>   <chr> <chr> <dbl>         <dbl>
#>  1 Austria AUT   Women  2000        0.0353
#>  2 Austria AUT   Women  2001        0.0353
#>  3 Austria AUT   Women  2002        0.0352
#>  4 Austria AUT   Women  2003        0.0351
#>  5 Austria AUT   Women  2004        0.0350
#>  6 Austria AUT   Women  2005        0.0348
#>  7 Austria AUT   Women  2006        0.0345
#>  8 Austria AUT   Women  2007        0.0342
#>  9 Austria AUT   Women  2008        0.0339
#> 10 Austria AUT   Women  2009        0.0335
#> # ... with 425 more rows
  • We change some column names to be more consistent between our datasets.

  • Finally we group observations by country to get the mean of prevalence of diabetes between 2000 and 2013 for each european countries :

  • For men :

#> # A tibble: 29 x 2
#>    country_code prop_men_diabetes
#>    <chr>                    <dbl>
#>  1 AUT                     0.0532
#>  2 BEL                     0.0573
#>  3 BGR                     0.0735
#>  4 CHE                     0.0498
#>  5 CYP                     0.0769
#>  6 CZE                     0.0778
#>  7 DEU                     0.0587
#>  8 DNK                     0.0546
#>  9 ESP                     0.0839
#> 10 EST                     0.0712
#> # ... with 19 more rows

*For women :

#> # A tibble: 29 x 2
#>    country_code prop_women_diabetes
#>    <chr>                      <dbl>
#>  1 AUT                       0.0340
#>  2 BEL                       0.0386
#>  3 BGR                       0.0640
#>  4 CHE                       0.0301
#>  5 CYP                       0.0561
#>  6 CZE                       0.0651
#>  7 DEU                       0.0399
#>  8 DNK                       0.0351
#>  9 ESP                       0.0588
#> 10 EST                       0.0641
#> # ... with 19 more rows

Joining tables

Finally, we joined all three tables in one dataset with the ‘country_code’ key :

#> # A tibble: 29 x 10
#> # Groups:   country_name [29]
#>    country_name   country_code avg_gdp prop_men_diabetes
#>    <chr>          <chr>          <dbl>             <dbl>
#>  1 Austria        AUT            336.             0.0532
#>  2 Belgium        BEL            407.             0.0573
#>  3 Bulgaria       BGR             37.4            0.0735
#>  4 Croatia        HRV             48.1            0.0713
#>  5 Cyprus         CYP             20.2            0.0769
#>  6 Czech Republic CZE            158.             0.0778
#>  7 Denmark        DNK            275.             0.0546
#>  8 Estonia        EST             16.5            0.0712
#>  9 Finland        FIN            217.             0.0657
#> 10 France         FRA           2283.             0.0709
#> # ... with 19 more rows, and 6 more variables:
#> #   prop_women_diabetes <dbl>, cal_prot_animal <dbl>,
#> #   cal_prot_plant <dbl>, cal_carbs <dbl>, cal_fat <dbl>,
#> #   total_consumption <dbl>

Exploratory data analysis

First, even though we will be taking the means of the variables with which we are trying to answer our questions, it is interesting to observe their evolution in each country over time. We started with the GDP.

GDP per country

Plotting GDP against Diabetes (Men & Women)

However, we see that apart of 5 outliers, our observations are mostly bunched up at the left of the graph. We decided to exclude these 5 observations, to see if we can observe a trend with the other countries. These outliers, as observed in the graph before, are the countries that had a big increase of GDP in the time period of 2000-2013.

Without the outliers, we can see a bit more clearly. Indeed, it seems that the richer a country is, the lesser it has a high diabetes rate among its population. But what about the 5 most rich countries in Europe ?

Evolution of calories in each countries

First, we tried to see if there was a trend in the consumption of macro-nutrients in the 2000s for each country in our sample by plotting those evolution over time:

There do not seem to be any trends in the graphs above and diets are rather stable in each country, which is why we will take the average consumption for each macro-nutrient. We can however note that the 5 outliers mentioned before have a higher fat consumption than the countries with a smaller GDP.

GDP againt total calories

We now plot the different consumption and the total consumption of each macro-nutrients to see if there’s a trend:

We see that total consumption does not really change. We then plot the total consumption with each country’s GDP.

We always end up with these 5 outliers that have a higher than average GDP so if we remove them from the average we have:

Now we can see clearly that there’s a trend.

Evolution of Diabetes over time

To begin with, we tried to see if there was a trend in the evolution of diabetes prevalence in the 2000s for each country in our sample by plotting those evolution over time:

We see right away that the prevalence of diabetes is higher for man than women across all countries (there are however two exceptions : in Romania between 2000 and 2003 and Slovenia between 2000 and 2006). Moreover, we observe that we have a decrease over time for these countries : * Belgium * Denmark * Finland We have countries with a decrease over time for women and not for men: * Austria * Malta * Netherlands * Germany * Italy * Spain * Switzerland In other european countries, the prevalence of diabetes is increasing (at different pace) over time.

Plotting Diabetes against each type of calories consumption

Now we want to see the diabetes prevalence against total calories consumption and also against each type calories consumed:

We can see a negative trend for the total consumption and the calories from animal protein. We can observe a positive trend against calories from plant protein. For protein from carbohydrates, we can see a slighty positive trend for women.

Now we want to see if we have different trend when we remove our 5 outliers:

Without our 5 outliers, we observe not much change in the trend of each type of calories consumed apart for carbohydrates where the trend changes for men and become slighly positive.